Optimization in general

Optimization in general

Table of Contents

This is about the optimizing of CPU intensive computer programs. Although optimization of computer code is dependent on the actual CPU at hand, there are some general principles which, when obeyed, will result in code that performs well on most, if not all, modern processors.

In order to get some feeling for the important aspects of producing efficient programs, we first take a look at the common characteristics of modern processors:

How to make use efficiently of the modern hardware

First, one should evaluate the possibility to use vendor libraries. Because vendors try to advertise their products as being the fastest processor available, in general they take care that optimized numerical libraries tuned for their own processors are available. In general, vendors have available optimized LAPACK and BLAS subroutines, and Fast Fourier subroutines. When these libraries are not available, usage can be made from the ATLAS package. When one installs this freely available package, it determines on what kind of hardware it is running. Based upon the characteristics found, optimal code for the BLAS and some LAPACK subroutines is generated. Often it is not easily possible to transform one problem to standard linear algebra and Fourier transforms. In those cases it is necessary to produce code yourself. Roughly the following steps are to be taken:

Coding for efficient cache usage

Register and cache usage is very important for the efficiency of a program. To use registers and cache efficiently some global knowledge about the way compilers generate code is necessary. When the compiler detects a loop, then in general the loop variables are placed in registers, which allows for very fast access. Also, terms that do not change value during the loop, are calculated beforehand and the result is placed in a register for efficient use. In general, compilers will perform the optimizations one could do by hand at a first glance at the loop. So, it is important that loops are recognizable for a compiler, and also the structure of data access should be apparent. For example: a Fortran construction like

do i=1,m

do j=1,n

c(i,j)=0

do k=1,p

c(i,j) = c(i,j)+a(i,k)*n(k,j)

enddo

enddo

enddo

c Note: this is not an optimal piece of code!

will be optimized by a modern compiler. It is even possible that the compiler recognizes this as a matrix multiplication and generates very efficient code that is specially developed for this purpose. In general the compiler will recognize the fact that the address of c(i,j) is constant in the k-loop and that c(i,j) can be placed in a register. Only after the end of the k-loop c(i,j) is to be written to memory. The loop variables i,j and k are kept in registers. Only after the end of the i-loop the end-values + 1 could be written to memory (however: following the Fortran standard, a loop variable is undefined after a completion of a loop, so even this step is not necessary). One could have coded the example as follows:

i=0

10 i=i+1

if (i .gt. m) goto 100

j=0

20 j=j+1

if (j .gt. n) goto 20

k=0

x=0.0

30 k=k+1

if (k .gt. p) goto 80

x = x+a(i,k)*b(k,j)

goto 30

80 c(i,j)=x

goto 20

100 continue

Obviously, this code is much less clear for the human eye, but one can also expect that a compiler will not recognize this as a three-nested loop. So, depending on the sophistication of the compiler, it is possible that in this case no loop is recognized at all, and that the compiler will only perform some very basic optimizations. In general one should try to:

for (i=0; i

a[i] = a[i]+x+y; /* x and y loop-independent */

and not:

for (i=0; i

a[i] = x+a[i]+y; /* x and y loop-independent */

Even if the compiler notices in the second case, that x+y is constant in the loop, it may decide not to take advantage of that fact, because adding numbers in a different sequence can yield different results when dealing with floating point representations with a limited accuracy.